Add Zenflow code for Stage 1 & 2#7391
Conversation
Hi @tohtana, thank you for the thoughtful review and suggestions! I tried my best to avoid adding ZenFlow logic directly into engine and zero optimizer. But for some shared functions like average_tensor, fully separating it would mean rewriting a large function with mostly duplicated code, which might make future maintenance harder when upstream changes. I’m happy to improve this further if this is considered a better practice — I’m just not entirely sure if full separation is the right trade-off here. |
|
Hi @Antlera, |
- Add ZenFlowCPUAdam and ZenFlowSelectiveAdamW for selective updates - Implement ZenFlowZeroOptimizer and its parallel variant - Support gradient offloading and communication overlap - Implement (un)flatten ops for column-major layout Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>
- Define ZenFlowConfig with support for selective update parameters - Add validation for ZenFlow-related config fields Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>
- Implement ZenFlow configuration and optimizer support in DeepSpeedEngine - Introduce methods for configuring ZenFlow parameters and handling selective updates - Enhance optimizer selection logic to accommodate ZenFlow optimizers - Update step function to manage ZenFlow-specific behaviors during training Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>
- Introduce tests to validate the behavior of DeepSpeedZeroConfig with various configurations for ZenFlowConfig, including stage enumeration and offload optimizer settings. - Ensure proper coercion of dictionary inputs into ZenFlowConfig and validate error handling for incorrect types. - Test combined usage of offload_optimizer and zenflow configurations under stage 2. Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
- Fix initialization logic for ZenFlowCPUAdam - Fix gradient update issues in ZenFlowSelectiveAdamW Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Signed-off-by: Yusen Wu <xrn4ub@virginia.edu> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>
- Introduce tests for ZenFlowSelectiveAdamW covering both offload and non-offload modes. - Validate step and group_step behavior with selected index updates and temporary parameter storage. - Ensure correct handling of 1D and 2D parameters, as well as proper gradient/state cleanup after updates. - Verify state increment logic and compatibility with PyTorch's native AdamW for numerical correctness. Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Signed-off-by: Yusen Wu <xrn4ub@virginia.edu> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>
- Introduce a new tutorial for ZenFlow, detailing its configuration and usage in DeepSpeed. Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>
Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
- Updated methods to accept communication_data_type as a parameter for better handling of IPG buckets. - Removed debug print statements to clean up the code. Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>
- Move `_configure_zenflow` logic to a standalone `configure_zenflow()` function in `zenflow_utils.py` - Refactor ZenFlow place to decouple it from ZeRO internals Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
- Simplify the `_configure_zenflow` method by assigning it a lambda function that calls `configure_zenflow(self)`. - Update the optimizer's selective learning rate synchronization to directly reference `self.optimizer._sync_selective_optimizer_lr()`. Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
- Fixed the invocation of `reduce_gradients` in ZenFlow + ZeRO Stage 1 - Corrected the reduction logic in `extra_large_grad_reduc` to handle gradient aggregation properly - Fixed a bug where ZenFlow could not initialize if the user did not provide a dataset Signed-off-by: Yusen Wu <xrn4ub@virginia.edu>
- Implemented single-GPU and distributed tests for ZenFlow with ZeRO Stage 1 and 2 - Covered various configurations of selective optimizer offloading, selection strategies (auto/step/epoch), update intervals, and warm-up rounds - Ensured ZenFlow can initialize and train under different parameter combinations Signed-off-by: Yusen Wu <xrn4ub@virginia.edu>
Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
|
@sfc-gh-truwase All copyright issues have been fixed. |
Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Guokai Ma <guokai.ma@gmail.com>
|
@delock do you have additional concerns or can we merge this? Thanks |
|
The Since the workflow installs the latest CPU build by default, it pulled 2.8.0+cpu, which caused the version check in tests/conftest.py to fail and exit. =================================== FAILURES ===================================
______ TestMultipleModels.test_zero_optimizer[True-False-False-2-False-3] ______
[gw3] linux -- Python 3.12.3 /home/runner/work/DeepSpeed/DeepSpeed/unit-test-venv/bin/python
:-1: running the test CRASHED with signal 0
------------------------------- captured stderr --------------------------------
/home/runner/work/DeepSpeed/DeepSpeed/tests/conftest.py:50: _pytest.outcomes.Exit: expected torch version 2.7 did not match found torch version 2.8.0+cpu
___________________ TestNoSyncCtxt.test_zero_stage[0-dtype2] ___________________
[gw0] linux -- Python 3.12.3 /home/runner/work/DeepSpeed/DeepSpeed/unit-test-venv/bin/python
:-1: running the test CRASHED with signal 0
------------------------------- captured stderr --------------------------------
/home/runner/work/DeepSpeed/DeepSpeed/tests/conftest.py:50: _pytest.outcomes.Exit: expected torch version 2.7 did not match found torch version 2.8.0+cpu
______ TestMultipleModels.test_zero_optimizer[True-False-False-2-False-2] ______
[gw1] linux -- Python 3.12.3 /home/runner/work/DeepSpeed/DeepSpeed/unit-test-venv/bin/python
:-1: running the test CRASHED with signal 0
------------------------------- captured stderr --------------------------------
/home/runner/work/DeepSpeed/DeepSpeed/tests/conftest.py:50: _pytest.outcomes.Exit: expected torch version 2.7 did not match found torch version 2.8.0+cpu |
|
@sfc-gh-truwase Merged for checking the new ut. |
|
@sfc-gh-truwase Not sure about what happen to the new up-coming errors in the CIs. It says Run modal run -m ci.torch_latest
╭─ Error ──────────────────────────────────────────────────────────────────────╮
│ Token missing. Could not authenticate client. If you have token credentials, │
│ see modal.com/docs/reference/modal.config for setup help. If you are a new │
│ user, register an account at modal.com, then run `modal token new`. │
╰──────────────────────────────────────────────────────────────────────────────╯
Error: Process completed with exit code 1. |
|
This might be relevant to #7289. Possible problem: The CI failures on forked PRs are due to Modal authentication. |
|
Merged for checking the new CI. Maybe re-run it will solve the problem. I assume this will make this branch up-to-date. |
This PR adds a blog post and images for ZenFlow, introducing its design, benefits, and usage. The blog explains how ZenFlow improves GPU utilization by overlapping computation and communication during offloaded training. See also: deepspeedai#7391 – core ZenFlow implementation. [deepspeedai#982](deepspeedai/DeepSpeedExamples#982) - – benchmarking and fine-tuning example. --------- Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Signed-off-by: lym <letusgo126@126.com>
This PR adds ZenFlow, a importance-aware offloaded training framework for DeepSpeed ZeRO. ZenFlow enables multi-step overlap between computation and communication during offloaded training, improving GPU utilization and reducing stalls. Highlights: - New ZenFlow optimizers (ZenFlowCPUAdam, ZenFlowSelectiveAdamW) - ZenFlowZeroOptimizer for ZeRO Stage 1/2 integration - Configurable via ZenFlowConfig, integrated with DeepSpeedZeroConfig - Unit tests and documentation included Note: This PR focuses on Stage 1 and 2 integration. Stage 3 support will be introduced in a follow-up PR. --------- Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Signed-off-by: Yusen Wu <xrn4ub@virginia.edu> Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com> Co-authored-by: Guokai Ma <guokai.ma@gmail.com> Signed-off-by: lym <letusgo126@126.com>
This PR adds a blog post and images for ZenFlow, introducing its design, benefits, and usage. The blog explains how ZenFlow improves GPU utilization by overlapping computation and communication during offloaded training. See also: deepspeedai#7391 – core ZenFlow implementation. [deepspeedai#982](deepspeedai/DeepSpeedExamples#982) - – benchmarking and fine-tuning example. --------- Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
This PR adds ZenFlow, a importance-aware offloaded training framework for DeepSpeed ZeRO. ZenFlow enables multi-step overlap between computation and communication during offloaded training, improving GPU utilization and reducing stalls. Highlights: - New ZenFlow optimizers (ZenFlowCPUAdam, ZenFlowSelectiveAdamW) - ZenFlowZeroOptimizer for ZeRO Stage 1/2 integration - Configurable via ZenFlowConfig, integrated with DeepSpeedZeroConfig - Unit tests and documentation included Note: This PR focuses on Stage 1 and 2 integration. Stage 3 support will be introduced in a follow-up PR. --------- Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Signed-off-by: Yusen Wu <xrn4ub@virginia.edu> Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com> Co-authored-by: Guokai Ma <guokai.ma@gmail.com>
This PR completes the ZenFlow integration for DeepSpeed ZeRO Stage 3. Highlights: - ZenFlowSelectiveAdamW_stage3: Optimizer with importance-aware selective parameter updates for ZeRO Stage 3. - ZenFlowZeroOptimizer_Stage3: Full Stage 3 optimizer integration with partitioned parameters and CPU offload. - Configurable via ZenFlowConfig, fully integrated with DeepSpeedZeroConfig for Stage 3. - Unit tests for Stage 3 cases ensuring correctness and compatibility. Note: Intergration with ZeRO Stage 1&2 was introduced in #7391 --------- Signed-off-by: Yusen Wu <xrn4ub@virginia.edu> Co-authored-by: Ma, Guokai <guokai.ma@intel.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Tingfeng Lan <erc8gx@virginia.edu>
This PR adds ZenFlow, a importance-aware offloaded training framework for DeepSpeed ZeRO. ZenFlow enables multi-step overlap between computation and communication during offloaded training, improving GPU utilization and reducing stalls.
Highlights:
Note: This PR focuses on Stage 1 and 2 integration. Stage 3 support will be introduced in a follow-up PR.